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Abstract 


Background: The outbreak of severe acute respiratory syndrome (SARS) caused a severe global 
epidemic in 2003 which led to hundreds of deaths and many thousands of hospitalizations. The virus 
causing SARS was identified as a novel coronavirus (SARS-CoV) and multiple genomic sequences 
have been revealed since mid-April, 2003. After a quiet summer and fall in 2003, the newly emerged 
SARS cases in Asia, particularly the latest cases in China, are reinforcing a wide-spread belief that 
the SARS epidemic would strike back. With the understanding that SARS-CoV might be with 
humans for years to come, knowledge of the evolutionary mechanism of the SARS-CoV, including 
its mutation rate and emergence time, is fundamental to battle this deadly pathogen. To date, the 
speed at which the deadly virus evolved in nature and the elapsed time before it was transmitted 
to humans remains poorly understood. 


Results: Sixteen complete genomic sequences with available clinical histories during the SARS 
outbreak were analyzed. After careful examination of multiple-sequence alignment, |14 single 
nucleotide variations were identified. To minimize the effects of sequencing errors and additional 
mutations during the cell culture, three strategies were applied to estimate the mutation rate by |) 
using the closely related sequences as background controls; 2) adjusting the divergence time for 
cell culture; or 3) using the common variants only. The mutation rate in the SARS-CoV genome 
was estimated to be 0.80 — 2.38 x 10-3 nucleotide substitution per site per year which is in the same 
order of magnitude as other RNA viruses. The non-synonymous and synonymous substitution 
rates were estimated to be |.16 — 3.30 x 10-3and 1.67 — 4.67 x 10-3 per site per year, respectively. 
The most recent common ancestor of the 16 sequences was inferred to be present as early as the 
spring of 2002. 


Conclusions: The estimated mutation rates in the SARS-CoV using multiple strategies were not 
unusual among coronaviruses and moderate compared to those in other RNA viruses. All 
estimates of mutation rates led to the inference that the SARS-CoV could have been with humans 
in the spring of 2002 without causing a severe epidemic. 
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Background 

The earliest confirmed case of the severe acute respiratory 
syndrome (SARS) occurred in November, 2002 in the 
Guangdong province of China. Toward the end of the epi- 
demic (as reported by July 31, 2003) there were 8,098 rec- 
ognized cases in 31 countries or regions worldwide and 
774 implicated deaths (WHO, http://www.who.int/csr/ 
sars/country/table2003 09 23/en/). Due to an unprece- 
dented international effort, the SARS coronavirus (SARS- 
CoV) was identified as the causal agent in late March 2003 
and its first complete genomic sequences were published 
April 13, 2003 [1,2]. One month later, SARS-like corona- 
viruses were found in palm civets and other animals in 
Guangdong, China, the first evidence of possible interspe- 
cies transmission of the virus [3]. The re-emergence of the 
isolated SARS cases in Asia in December, 2003 and in 
Anhui province and Beijing, China, in late April 2004, has 
confirmed a wide-spread conjecture that the SARS-CoV 
will likely be with humans for years to come. This re- 
emergence of SARS cases makes it legitimate to critically 
re-evaluate the time for the origin of the SARS-CoV. 


There are 26 putative coding regions which cover about 
98% of the 29.8-kb SARS-CoV genome. Approximately 
two-thirds of the genome are at the 5' side encoding the 
nonstructural proteins (orflab and orfla) and one-third 
are at the 3' side encoding four structural proteins: spike 
glycoprotein (S), envelope (E), membrane (M), and 
nucleocapsid (N) [4]. The spike glycoprotein, especially 
its S1 subdomain, is responsible for binding to the specific 
receptor in the target cells [4,5]. RNA polymerase and 
nsp1 genes are two major loci in orflab. 


Estimating the mutation rate in RNA viruses and retrovi- 
ruses is critical but also challenging for tracing their rap- 
idly evolving paths. The rates estimated from the positive- 
strand ssRNA virus appear to be in a similar range (e.g., 
~10°3 per site per year) from the negative-strand ssRNA 
virus, although a direct comparison is not possible 
because the mutation rates could be estimated from dif- 
ferent regions or genes [6-15]. The estimated mutation 
rates in coronavirus, which SARS-CoV phylogenetically 
links to, are moderate to high compared to the others in 
the category of ssRNA viruses. For example, it was esti- 
mated to be 0.3 - 0.6 x 10° per site per year in the infec- 
tious bronchitis virus in a previous study [8]. However, 
the estimated mutation rate appears to have a wider range 
in the retrovirus [16-20]. More details are presented in the 
Discussion section. 


How SARS-CoV evolves has important implications for 
both strategic planning in the prevention of SARS epidem- 
ics and development of a vaccine and antibodies. The 
mutation rate is among the most fundamental aspect of 
sequence evolution. If the pathogen evolves slowly, there 
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will be a better chance for development of effective long 
lasting vaccines and successful treatment for patients from 
a particular geographic region will likely be effective for 
patients from other areas. On the other hand, if the path- 
ogen (particularly the genes coding for major antigens) 
evolves rapidly, an effective strategy to prevent transmis- 
sion of the SARS-CoV must be the top-priority, and an 
effective vaccine program may be problematic. The pur- 
pose of this study is to improve our understanding of the 
evolutionary mechanism in the SARS-CoV genome, and 
in particular to address the issues of the mutation rate and 
the time for the emergence of the SARS-CoV in the human 
population. We reported the estimated mutation rate in 
the SARS-CoV using the available complete genomic 
sequences whose clinical history either is certain or could 
be inferred. 


Results 

Mutation rate 

The sources of the genomic sequences used in this study 
and the methods of estimating mutation rates are pre- 
sented in the Methods section. The divergence time was 
inferred based on the information summarized in Figure 
1. Table 1 shows the mutation rates estimated by three 
strategies. When the first strategy was used to adjust for 
sequencing errors and potential mutations in the cell cul- 
ture, the mutation rate was estimated to be 0.80 - 2.38 x 
10-3 nucleotide substitution per site per year using all the 
sequences not generated from mainland China, and 0.81 
- 1.38 x 10-3 nucleotide substitution per site per year 
using the TOR2 and Urbani sequences only. When the 
second strategy was used, the mutation rate was estimated 
to be 0.74 - 1.62 x 10-3 nucleotide substitution per site per 
year, which is lower than that from using the first strategy. 
As expected, the mutation rate estimated using the third 
strategy was the lowest; 0.54 - 1.57 x 10-3 nucleotide sub- 
stitution per site per year using the 11 sequences not gen- 
erated from mainland China and 0.42 - 0.72 x 103 
nucleotide substitution per site per year using the TOR2 
and Urbani sequences only. 


Substitution rate in the coding regions 

For all samples, the proportion of non-synonymous sub- 
stitutions per non-synonymous site (Ka) was 0.63 x 10° 
and the proportion of synonymous substitutions per syn- 
onymous site (Ks) was 0.65 x 10-3, leading to Ka/Ks being 
0.97. This ratio was 0.79 in the nonstructural region and 
1.37 in the structural region. In particular, the values of 
Ka/Ks were 1.98 for nsp1 and 0.85 for S. 


Table 2 shows the rates of nucleotide substitution in the 
coding regions of sequences. The overall rates of non-syn- 
onymous and synonymous substitutions were 1.16 — 3.30 
x 103 and 1.67 - 4.67 x 10-3 per site per year, respectively. 
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Clinical relations and estimated range of the divergence time among 16 SARS-CoV isolates. This figure is adapted 
from Figure 5 in [4]. Solid arrows indicate the certain SARS coronavirus transmission route and dashed lines indicate the 
uncertain route. SINxxxx denotes an unavailable primary contact of the Singaporean index patient (SIN2500). The numbers 
indicate a range of the diverged time (days) between two isolates. 


The non-synonymous rate was higher in the three genes E, 
M, and N, suggesting some of those mutations might 
increase antigenicity, although the number of mutations 
used to calculate these rates was small. 


Time for the origin of SARS-CoV 

The mutation rate estimated earlier allowed us to estimate 
the age of the most recent common ancestor (MRCA) of 
the sample, which should be about the same or more 


recent than the time for the origin of SARS-CoV. The phy- 
logeny reconstructed by the neighbor-joining method 
with mid-point rooting or by maximum parsimony is 
overall consistent with the epidemic (Additional file 1). 
All the sequences from mainland China clustered together 
and separated from the remaining sequences, including 
those clinically related to the index patient A. GZ01 was 
distantly separated from other sequences. Assuming the 
MRCA is the root of the phylogeny, the age of the MRCA 
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Table |: Mutation rate (per site per year). 
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TOR2-Urbani I | sequences 
t (days) L(x 10°) t (days) (x 103) 
Method | 34-58 0.81-1.38 25.1-70.4 0.80-2.38 
Method 2 48-72 0.85—1.28 37.4-78.6 0.74-1.62 
Method 3 34-58 0.42-0.72 23.4-64.6 0.54-1.57 


In method I, the nucleotide difference (3.2) among five Singaporean sequences was used to adjust the sequence errors and mutations that occurred 
during cell culture. In method 2, the number of variants between two sequences was reduced by 2 and the divergence time was increased by 14 
days. In method 3, the nucleotide variants that were observed only once among the isolates were excluded. t = range of divergence time (days). // 


= mutation rate (per site per year). 


Table 2: Substitution rates (x 10-3 per site per year) and Ka/Ks ratio in the coding regions. 


Non-synonymous sites Synonymous sites Ka/Ks 
Total 1.16—-3.30 1.67-4.67 0.70 
Nonstructural region 0.81-2.40 1.78-5.07 0.46 
Structural region 2.03-5.53 |.40-3.69 1.47 
Nsp| 1.05-3.13 0.85-2.60 1.22 
S 1.1 1-3.02 3.22-8.50 0.35 
EMN 3.35-9.22 0 » | 


The same divergence time as in Table | was used. Nonstructural region denotes the 5' two-thirds of the coding regions (sites 265 — 21485) and 
structural region denotes the 3' one-third of the coding regions (21492 — 29388). EMN denotes three genes E, M, and N. 


is then the divergence time between GZO1 and other 
sequences. Using the mutation rates estimated above, it is 
found that the MRCA could be alive at a time between 
March 28 and November 29, 2002 (strategy 1), between 
February 22 and October 3, 2002 (strategy 2), and even 
earlier (strategy 3). The most critical implication of these 
analyses is that it is entirely plausible that the MRCA of the 
sample could be alive as early as the spring of 2002. 


Discussion 

Some uncertainties in the quality of the sequence data and 
incomplete information from patient histories are two 
limiting factors of this study. The world-wide race to 
understand this novel virus has provided an unprece- 
dented set of complete genome sequences of a pathogen 
in an interval of a few weeks, but likely side-effects of this 
race might be an elevated error rate in the released 
sequences and generating errors during the analysis. 
Among the 129 sequence variations reported [4], many 
were generated randomly by the algorithms during the 
alignment of the multiple sequences, therefore these 
should be removed or adjusted. The concern above has 
led us to wait until all the sequences used in this study 
have been significantly revised by their generators and to 
manually adjust the multiple-sequence alignment. Still 


some errors were unavoidable partly due to the intrinsic 
error rate of sequencing technology. For example among 
18 common variations, 9 could not be uniquely assigned 
to the internal branches of the phylogeny. This incongru- 
ence is likely partially due to sequence errors. The exist- 
ence of sequence errors can also be inferred by examining 
the ratio of transitional versus transversional changes. If 
nucleotide substitution occurs randomly, there are two 
transversional substitutions on average for each transi- 
tional substitution, and the ratio of transition to transver- 
sion should be 0.5. However, transition is generally 
favored over transversion in many organisms. For exam- 
ple, the ratio is approximately 2 in the human genome 
[21,22]. The ratio has not been discussed extensively in 
the RNA viruses; however, it appears to be higher than 
that in the mammalian genomes based on the two previ- 
ous reports of 3.7 in the influenza A virus [23] and 5.0 in 
the Marburg virus [24]. In this study, 60 transitional sub- 
stitutions and 54 transversional substitutions were 
observed among the 16 sequences, thus the ratio was 1.1. 
The ratio in five sequences from mainland China was 0.9, 
considerably smaller than 2.2 which was observed in the 
other eleven sequences. This suggests that sequences from 
mainland China may be more erroneous than the other 
sequences. On the other hand, the ratio was 0.9 for the 
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http://www. biomedcentral.com/1471-2148/4/21 


Organism Mutation rate Ref. 
ssRNA positive-strand viruses (coronaviruses) 
Mouse hepatitis virus 0.44 — 2.77 x 10-2 per site per year [6] 
Transmissible gastroenteritis virus 0.7 x 10-3 per site per year [7] 
Infectious bronchitis virus 0.67 — 1.33 x 10 per site per year [8] 
ssRNA positive-strand viruses (non-coronaviruses) 
Hepatitis C virus 0.82 x 10-3 per site per year [9] 


GBV-C/HGV 
Foot-and-mouth disease virus 

ssRNA negative-strand viruses 
Influenza A virus 


Infectious salmon anaemia virus 
Measles virus 

Retroviruses 
HIV-1 


SlVagm virus 

Bovine leukemia virus 
Human T-cell leukemia virus 
Visna virus 


3.9 x 10°3 per site per year 
6 x10 per site per year 


[10] 
[It] 


[12] 
[13] 
[14] 
[15] 


2.28 x 10-3 per site per year 

2.3 x 10°3 per site per year 

0.96 x 103 per site per year 

0.9 x 10-4 per site per generation 


1.7 x 10-3 per site per year 

1.62 x 10-2 per site per year 

0.4 -7.2 x 10-2 per site per year 
4.8 x 10-6 per site per generation 
1.2 x 10° per site per generation 
1.7 x 10-3 per site per year 


[16] 
[!7] 
[18] 
[19] 
[19] 
[20] 


singleton variations, which was much lower than the ratio 
of 3.5 for the non-singleton variants. This further indi- 
cates that singletons were more problematic. 


Because of the unknown level of errors in the sequences, 
a conservative approach to estimating the mutation rate 
was taken. Three strategies were used to reduce the effect 
of sequence errors, one being more aggressive than the 
other two. The mutation rates estimated by the first two 
strategies were quite similar. In the third strategy, all the 
variants unique to a given isolate were excluded. Such a 
strategy is very conservative because the amount of single- 
tons is expected to be large in a rapid expanding environ- 
ment (see below). Therefore the mutation rate was placed 
in the range of 0.80 — 2.38 x 103 nucleotide substitution 
per site per year based on the 11 sequences used. This rate, 
along with the rate of synonymous substitutions esti- 
mated in this study, is close to that recently reported using 
another approach [25]. In comparison to other coronavi- 
ruses, this rate is lower than that in the mouse hepatitis 
virus, similar to that in the transmissible gastroenteritis 
virus, but higher than that in the infectious bronchitis 
virus (Table 3) [6-8]. The estimated mutation rate is at the 
same order of magnitude as in other RNA viruses, for 
example, 2.3 x 10-3 nucleotide substitution per site per 
year in the influenza A viruses [12,13]. The estimated 
mutation rate in HIV appears to have a wide range 
[16,17]. It is likely that the mutation rate in the SARS-CoV 
is not higher than that in HIV. Therefore, the SARS-CoV is 
not an unusual coronavirus or RNA virus in terms of its 


speed of nucleotide changes. One of the challenging tasks, 
therefore, is to find those variations which led to the 
SARS-CoV being unique from other RNA viruses, espe- 
cially coronaviruses, and how those variations changed 
the functionality and helped to transmit it to humans. 


Nucleotide variation is distributed along the entire 
genome. Based on our alignment and the annotation in 
GenBank, 21 of the 26 open reading frames had the vari- 
ations, including genes encoding polymerase, spike glyco- 
protein, envelope, membrane, and nucleocapsid protein. 
The estimated mutation rate suggests that approximately 
2 to 6 new mutations will occur each month in a virus 
assuming the overall uniform mutation rate. However, 
the rate of the non-synonymous substitutions might vary 
during the course of the SARS-CoV evolution [25]. It was 
observed that there was an excess of mutations (and 
amino acid changes) in the external branches of the phyl- 
ogeny of a large sample of the HA gene sequences of influ- 
enza A, which was partially caused by sampling bias [26]. 
From a population genetics standpoint, a large proportion 
of mutations should occur in the external branches when 
the infected hosts have rapidly increased. Therefore, one 
should not conclude that mutation rate is low because of 
a relatively small number of mutations in the internal 
branches [27]. Our analysis, even by a conservative esti- 
mation of mutation rate, indicates that the SARS-CoV 
population has already harbored a considerable amount 
of genetic diversity. 
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Name Accession ID Version Length (bp) First release date Last release date 
TOR2 AY274119 3 29751 14-Apr-03 16-May-03 
Urbani AY27874| | 29727 21-Apr-03 12-Aug-03 
CUHK-W1 AY278554 2 29736 18-Apr-03 31-Jul-03 
CUHK-Sul0 AY282752 | 29736 07-May-03 07-May-03 
HKU-39849 AY27849 | 2 29742 18-Apr-03 29-Aug-03 
SIN2500 AY283794 | 29711 09-May-03 12-Aug-03 
SIN2677 AY283795 | 29705 09-May-03 12-Aug-03 
SIN2679 AY283796 | 29711 09-May-03 12-Aug-03 
SIN2748 AY283797 | 29706 09-May-03 12-Aug-03 
SIN2774 AY283798 | 29711 09-May-03 09-May-03 
TWI AY291451 | 29729 14-May-03 14-May-03 
BJO| AY278488 2 29725 21-Apr-03 01-May-03 
BJO2 AY278487 3 29745 21-Apr-03 05-Jun-03 
BJO3 AY278490 3 29740 21-Apr-03 05-Jun-03 
BJ04 AY279354 2 29732 23-Apr-03 05-Jun-03 
GZOl AY278489 2 29757 21-Apr-03 18-Aug-03 


Based on the information in National Center for Biotechnology Information http://www.ncbi.nlm.nih.gov/ on August 31, 2003. 


The emerging time of the SARS-CoV is of special impor- 
tance in dissecting the origin of the virus as well as the 
dynamics of the epidemic. The time for the most recent 
common ancestor of the 16 isolates was estimated to be 
between February 2002 and November 2002. Although 
this is consistent with the date for the earliest known case 
of SARS and those estimated in other studies [25,28], it 
also suggests that SARS-CoV could have been present 
longer than generally believed, that is, around November 
2002. One possible scenario is that the SARS-CoV had 
already infected some people in the spring of 2002 but 
failed to cause epidemics; its spread was however sup- 
pressed in the summer (similar to the summer of 2003), 
and re-emerged around November to cause the epidemic 
in 2003. Given the current re-emergence of SARS cases, 
this scenario is becoming more likely. There were indeed 
some media reports of SARS-like symptoms of patients in 
the spring of 2002 although none have been convincingly 
confirmed. An alternative scenario is that the common 
ancestor of the SARS-CoV lived in the spring of 2002, but 
the host was animals. The recent finding of high sequence 
homology between the isolate from a newly emerged 
SARS case (December 16, 2003) and the isolates from the 
masked palm civets [29] makes civets as the primary sus- 
pect of reservoir for SARS-CoV. 


Conclusions 

The estimated mutation rate and the synonymous and 
non-synonymous substitution rates in the SARS-CoV 
genome were moderate compared to that in coronavirus 
and other RNA viruses, suggesting that the SARS-CoV is 
not an unusual coronavirus in terms of its speed of nucle- 
otide or amino acid changes. Based on the mutation rates 


estimated in this study, the emerging time of the most 
recent common ancestor of the 16 isolates can be placed 
between February 2002 and November 2002. This sug- 
gests that the SARS-CoV could have been with humans as 
early as the spring of 2002 without causing a severe epi- 
demic. 


Methods 

Sequence data 

We obtained 16 complete genomic sequences from the 
NCBI website http://www.ncbi.nlm.nih.gov/. Among 
them, five sequences (BJ01-04 and GZ01) were obtained 
from the hosts collected in mainland China and the 
remaining sequences (TOR2, Urbani, CUHK-W1, CUHK- 
Sul0, HKU-39849, five Singaporean sequences, and 
TW1) were from the hosts in other geographic regions. 
Detailed information of the sequences is shown in Table 
4. 


Sequence analysis 

CLUSTAL X [30], a window-based user interface to the 
CLUSTAL W, was used to align the multiple sequences. 
The alignment was further manually examined and 
adjusted. All gene annotation information and nucleotide 
position designations in this study refer to the TOR2 
sequence (GenBank accession ID: NC_004718). To avoid 
complications, only the single nucleotide variations were 
analyzed and all alignment gaps were excluded. This led 
to the identification of a total of 114 single nucleotide var- 
iations among all the sequences and an average of 18.2 
nucleotide differences between two sequences. 
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The MEGA2 computer program [31] was used to calculate 
the pair-wise nucleotide differences. The resulting genetic 
distances were corrected by Jukes and Cantor's method 
[32]. The phylogeny of the sample was reconstructed 
using both neighbor-joining and maximum parsimony 
methods [31,33]. 


Mutation rate can be estimated in principle by the number 
of nucleotide differences between two sequences divided 
by twice their divergent time, i.e., the time to their most 
recent common ancestor. Due to better documented con- 
tact histories, mutation rates were estimated only by the 
sequences whose hosts were not from mainland China, 
that is, sequences TOR2, Urbani, CUHK-W1, CUHK- 
Sul10, HKU-39849, five Singaporean sequences, and TW1. 
First, the range of the divergence time between each pair 
of sequences was inferred based on information on infec- 
tion history, reported strain isolation dates and sequence 
release dates (Additional file 2) [4,34-36]. For example, 
the divergence time between isolates TOR2 and Urbani 
was estimated to be in the range of 34 to 58 days [35,36]. 
Second, nucleotide difference between each pair of 
sequences was calculated with adjustments to reduce the 
effect of sequencing errors and potential mutations during 
cell culture. Three strategies were used. The first strategy 
was used to reduce the number of pair-wise nucleotide 
differences by the averaged number of nucleotide differ- 
ences observed in five closely related Singaporean 
sequences [4]. This strategy effectively assumes that there 
is no real nucleotide difference among these five 
sequences so that their observed differences reflect the 
level of errors. The second strategy was used to reduce the 
pair-wise nucleotide difference by two and to add 7 days 
to the divergence time to account for cell culture time. 
This strategy assumes that the mutation rate during the 
cell culture is the same as that in the human host and that 
on average the sequencing error is one nucleotide per 
genome. In the third strategy, we excluded all the nucle- 
otide variants which had been observed only once (single- 
tons) among the 61 human SARS-CoV sequences reported 
in [25]. The rational is that non-singleton mutations 
observed in a sample are much less likely due to sequenc- 
ing errors as well as mutations during the laboratory 
passage of virus. This strategy is apparently conservative 
and can be regarded as the lower bound of the mutation 
rate. Finally, the mutation rate per site per year was esti- 
mated by 
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where dj is the genetic distance between sequence i and j, 
t,; is twice their divergence time (in number of days), and 
n is the number of sequences. 


A mutation in a codon is non-synonymous (or non- 
silent) if it changes the amino acid, and is synonymous 
(silent) otherwise. The number of non-synonymous 
mutations per non-synonymous site (Ka) and the number 
of synonymous mutations per synonymous site (Ks) were 
computed using the method of Li, Wu, and Luo [37]. The 
non-synonymous and synonymous substitution rates 
were calculated using the divergence time as estimated 
above. Only the second strategy was applied to the rate 
estimation because the number of nucleotide differences 
used for the adjustment in the first strategy can not be sep- 
arated for the non-synonymous and synonymous muta- 
tions. 
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